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Abstract — Humans can perceive various object properties 
based solely on the sounds that the objects make when an action 
is performed on them. Similarly, robots in human-inhabited 
environments must be capable of learning and reasoning about 
the acoustic properties of the objects with which they interact. 
Such an ability would allow a robot to infer some object 
properties even if the object is not in direct line of sight. This 
paper presents a framework that allows a robot to infer the 
object with which it is interacting from the sounds generated by 
the object during the interaction. The framework is evaluated on 
a 7-DOF Barrett WAM robot which performs pushing, grasping, 
and dropping behaviors on 18 different objects. The results show 
that the robot is able to accurately recognize objects (e.g., bottles, 
cups, balls, etc.) based on their acoustic properties. Furthermore, 
the recognition accuracy can be improved if the robot performs 
a combination of different exploratory behaviors on each object. 

I. Introduction 

Human beings have the remarkable ability to extract the 
physical properties of objects from the sounds that they 
produce [1, 2]. Unlike our sense of vision, which is always 
constrained to a particular viewing direction, our auditory 
sense allows us to infer events in the world that are often 
outside the reach or range of other sensory modalities [3]. 

Using sound as a source of information would undoubtedly 
help a robot detect and reason about events in a human- 
inhabited environment. For example, if a robot accidentally 
knocks over an object that is outside of its field of view, the 
sound generated by the object as it falls to the ground will be 
the only source of information about the nature of the object. A 
robot grasping an object out of sight (e.g., a toy in a box) will 
only have access to auditory and tactile information regarding 
the type of object it is interacting with. Similarly, if a human 
interacts with an object that is outside the robot's field of view, 
the robot can use the detected sounds to infer the nature of the 
object and the type of interaction. These types of situations 
clearly present a challenge to traditional object recognition 
frameworks which rely heavily on computer vision methods. 

This paper addresses the problem of how a robot can 
recognize the object it is interacting with based on the detected 
sounds produced by the object. We present a framework 
in which the robot learns compact predictive models that 
can estimate the object class given the robot's exploratory 
behavior and the resulting sounds. Three different algorithms 
representing distinct families of machine learning methods are 
evaluated: k-Nearest Neighbor (an instance-based method). 




Fig. 1. The 7-DOF Barrett whole arm manipulator used in the experiments. 
The figure also shows the microphone used to record the sounds. 



Support Vector Machine (a discriminative learning algorithm), 
and a Bayesian Network (a probabilistic graphical model). 

The robot used in the experiments is a 7-DOF Barrett WAM 
arm shown in Figure 1. The robot's behavioral repertoire 
consists of three different behaviors (pushing, dropping, and 
grasping), which it applies to all objects that it encounters. 
Eighteen different objects were used for performance evalu- 
ation, including a bottle, a pop can, a book, etc. The three 
learning algorithms were evaluated based on how well they 
can generalize to novel auditory data not available during 
the training stage. The results show that by performing a 
combination of behaviors, the robot is able to improve its 
acoustic-based object recognition performance, regardless of 
the type of learning algorithm that is used. 

II. Related Work 

Despite the vast amount of information conveyed by the 
acoustic properties of everyday objects, there have been rel- 
atively few studies investigating how a robot could perceive 
object properties using auditory information. One of the first 
such studies was conducted by Krotkov et al. [4] in which 
the task of the robot was to identify the material type (e.g., 
glass, wood, etc.) of different objects by probing them with 
its end effector. In that study, the robot used a hitting behavior 
to recognize five different materials: aluminum, brass, glass, 
wood, and plastic. The results indicate that the spectrogram of 



the detected sound can be used as a powerful representation 
for discriminating between the five materials [4]. Subsequent 
work by Klatzky et al. [5] shows that modeling frequency and 
decay parameters of sounds can also be used to build a sound 
model for each material. 

More recently, Richmond et al. [6] have proposed a robotic 
platform for automatic sound measurement of contact sounds. 
Contact sounds are defined as the sounds generated when the 
end effector of the robot strikes the surface of an object. 
In subsequent work [7], Richmond proposes modeling the 
spectrogram of the sounds using spectrogram averaging, in 
order to learn models for contact sounds induced when striking 
different types of materials. 

Torres- Jara et al. [8] demonstrate how a robot can recognize 
objects based on the sounds they make when tapped by 
the robot's hand. In that study the robot performs tapping 
behaviors on the objects within reach and records the detected 
sound spectrograms. When tapping a novel object, the robot 
matches the spectrogram of the detected sound to one that 
is already in its training set which results in a prediction for 
the object's type. The results show that the robot is able to 
recognize with high accuracy four different objects of varying 
materials by tapping. Their work is perhaps the first example 
of interactive object recognition using auditory information by 
a robot. 

Following, this paper presents a framework in which the 
robot uses machine learning methods in order to perform 
auditory object recognition of 18 different objects using 3 
different behaviors. This paper also shows that by applying 
multiple different behaviors to an object, a robot could improve 
its auditory recognition performance regarding the object's 
type. 



III. Experimental Setup 



A. Robot 



The robot used in the experiments is a Barrett whole arm 
manipulator (WAM) with the 3-fingered Barrett hand as its 
end effector (see Figure 1). The robot arm has 7 degrees of 
freedom. The hand also has 7 degrees of freedom: two per 
each finger, and one that controls the spread of fingers 1 and 
2. 

B. Exploratory Behaviors 

The robot uses three exploratory behaviors {grasp, push, and 
drop) to learn the acoustic properties of different objects. The 
behaviors were encoded using the teach and play interface 
provided by the Barrett WAM API. Figure 2 shows before 
and after images for each of the three behaviors, which are 
described in more details below. 

1) Grasp behavior: The object is placed in front of the 
robot and the Barrett hand is positioned over it with fully 
outstretched fingers. Next, the command to close all fingers 
is executed resulting in the object being grasped by the hand. 
Figure 2. a shows an example of a grasp behavior performed 
on a whiteboard eraser, one of the eighteen objects used in 
the experiments. 




a) Example of a grasp behavior. 




b) Example of a push behavior. 




c) Example of a drop behavior. 

Fig. 2. Examples of the grasp, push, and drop behaviors used by the robot. 











i:jk 



Fig. 3. The eighteen objects used in the experiments. Top row: plastic bottle, 
plastic ball, rubber ball, tennis ball, plastic box, wooden plank; Second row: 
hockey puck, book, tin box, pop can, metal plate, soft plastic cup; Third row: 
paper box, eraser, metal flange, paper cup, hard plastic cup, wooden cube. 



2) Push behavior: The object is placed on the table and 
the robot arm executes a recorded trajectory that pushes the 
object sideways. During this behavior, the hand is placed in 
an open palm configuration. An example of the push behavior 
is shown in Figure 2.b. 

3) Drop behavior: The object is first grasped and then lifted 
to a pre- specified height above the table. Next, a command to 
open all three fingers is executed, resulting in the object falling 
and hitting the table. Figure 2.c shows the robot performing 
the drop behavior while holding the hockey puck object. 



C. Objects 

The set of objects, O, that the robot interacts with consists 
of 18 different objects, as shown in Figure 3. The objects 
include different types of balls, cups, containers, a book, a 
bottle, a hockey puck, a whiteboard eraser, etc. The objects 
are made of varying materials including metal, plastic, rubber, 
paper, and wood. Some of the objects can be knocked down 
when pushed while others simply slide or roll. In addition, 
some of the objects bounce multiple times off the table when 
dropped (e.g., the three balls) while others don't. 

D. Sound Recording and Feature Extraction 

Sounds were recorded at a sampling rate of 44.1 KHz with 
16 bit depth, processed through a Lexicon Alpha bus-powered 
audio interface. The audio was captured and segmented uti- 
lizing the digital audio processing package Audacity. The 
microphone used was a Rode NTl-A with a cardioid polar 
pattern having an average self noise of 5 dB. Signal leveling 
remained consistent for each trial while maintaining headroom 
to impede clipping. The microphone's output was routed to 
an ART Tube MP Studio microphone pre-amplifier. The pre- 
amplifier supplied 48 volt phantom power to the NTl-A 
microphone. Sufficient gain was used on the pre-amplifier to 
provide a suitable input level for the recording input/output 
device. No audio compression was used on the recordings. 

During the grasping behavior, each sound is segmented such 
that it starts with the initiation of the grasp motor command 
and ends once the decibel level has dropped to that of the 
background noise. During the dropping behavior, each sound 
starts once the object hits the ground and ends once the volume 
level has dropped to that of the background noise. Finally, for 
the pushing behavior, the sound is segmented such that it starts 
when the hand makes its first contact with the object and ends 
once the dB level has returned to that of the background. For 
all three behaviors, the segmentation was done automatically 
by thresholding the dB level at the start and end of each sound. 

Sound features were extracted using the log-normalized 
Discrete Fourier Transform (DFT) which was computed for 
each sound, using 2^ + 1 =65 frequency bins. The SPHINX4 
natural language processing library package was used to com- 
pute the DFT for each sound [9]. Next, given the DFT matrix 
for each sound, a 2-D histogram is computed by discretizing 
time into kt bins and frequencies into kf bins. The value for 
each bin in the histogram is set to the average of the values in 
the DFT matrix that fall into it. In all experiments conducted, 
kt was set to 10 and kf was set to 5. Hence, each sound is 
represented by feature vector, 5, where S G R^^^^. Figure 4 
shows an example of how the DFT of a sound is transformed 
into a 2-D histogram across time and frequency. 

E. Data Collection 

Let B = [grasp, push, drop] denote the set of exploratory 
behaviors. For each of the three behaviors, the robot performs 
six trials for each of the eighteen objects resulting in a total of 
3x6x18 = 324 recorded trials. During the i^^ trial, the robot 
records a data triple of the form {Bi^Oi^ Si), where Bi e B 




b) DFT of sound wave 



c) 2-D Histogram of DFT 



Fig. 4. Example feature extraction from the sound generated by applying 
the grasp behavior on the pop can object. The raw sound wave is shown in 
a), where the horizontal axis denotes time and the vertical axis denotes dB 
level. The Discrete Fourier Transform is shown in b). The resulting set of 
features is shown in c) as a 2-D histogram. In both b) and c) the horizontal 
axis denotes time, while the vertical axis denotes frequency. 



is the behavior executed, O^ G O is the object on which the 
behavior was performed, and Si G R^^^^ is the feature vector 
extracted from the detected sound. Each triple indicates that 
the sound features Si were detected when performing behavior 
Bi on object Oi. Given such data, the task of the robot is to 
learn a model that can predict the object Oi in the interaction 
given the behavior Bi and sound features Si. The next section 
describes the learning framework used to solve this task. 

IV. Learning Methodology 
A. Problem Formulation 

The learning task of the robot is formulated as follows: for 
each exploratory behavior B e B, learn a model Mb such 
that Mb (Si) -^ Oi, where Oi is the object present during 
the interaction that generated the sound feature vector Si. 
In other words, the robot needs to learn a model that can 
predict the object class given the detected sound features. 
More specifically, the model should be able to estimate the 
conditional probability PrB{Oi = o\Si) for each object o G 
O, and each behavior B e B, given the detected sound feature 
vector Si. 

Given that the robot is capable of performing three different 
behaviors, the task is to learn the models M grasps ^push^ and 
Mdrop- For each behavior, given a set of training examples 
{Si^Oi}, where i = 1,...,A^, the robot uses a supervised 
machine learning algorithm in order to learn a model that can 
estimate PrB{Oi = o\Si). The model can then be evaluated 
on novel data which was not used during the training stage. 
Three different machine learning algorithms are evaluated: k- 
Nearest Neighbor (k-NN), Support Vector Machine (SVM), 
and Bayesian Network. The three algorithms were chosen to 
represent three general families of machine learning models: 
instance-based (e.g., k-NN), discriminative (e.g., SVM), and 
generative (e.g., Bayesian Networks). 



B. Learning Algorithms 

1) K-Nearest Neighbor: K-Nearest Neighbor (k-NN) is an 
instance-based learning algorithm which simply stores all data 
points and their class labels and only uses them when the 
model is queried to make a prediction. The k-NN model falls 
within the family of lazy learning or memory-based learning 
algorithms [10, 11]. 

When asked to make a prediction on a test data point, k-NN 
finds the k closest neighbors of the query point and assigns a 
class label which is a smoothed average of the labels belonging 
to the selected neighbors. While k-NN is very simple to train 
and use, its performance can suffer if there are irrelevant 
features in the input space without the introduction of special 
distance and attribute weighting functions. 

In the experiments conducted in this study, k was set to 
3, and the WEKA [12] implementation of k-NN was used. 
Obtaining an estimate for PrB{Oi = o\Si) is done by 
counting the class labels of the k neighbors. For example, 
if two of those neighbors have class label Rubber Ball and 
one. Tennis Ball, then the estimated probability that the class 
label for the test point is Rubber Ball is 0.67; that of Tennis 
Ball is 0.33; and that of all other class labels is 0.0. 

2) Support Vector Machine: Support Vector Machine 
(SVM) classifier is a supervised learning model that falls into 
the family of discriminative models [13]. Given a set of labeled 
inputs {:s.i,yi)i=i,...,h ^i ^ '^'^ and i/i G {— 1,+1}, training 
an SVM classifier is reduced to learning a linear decision 
function /(x) =< x,v^ > +6, v^ G R"^ and 6 G R, that can 
discriminate between positive (+1) and negative (—1) labeled 
inputs. The linear decision function /(x) is learned by solving 
a dual quadratic optimization problem, where w and b are 
optimized such that the margin of separation between the two 
classes is maximized [13]. 

For many problems, however, a good linear decision func- 
tion /(x) in the n-dimensional input space does not exist. 
In such cases, the labeled inputs can be mapped into a 
(possibly) higher-dimensional feature space, e.g., x^ -^ <^(x^), 
where a good linear decision function can be found. The 
mapping is defined implicitly through the use a kernel function 
K(x^,Xj) =< ^(x^),^(xj) > that is subject to Mercer's 
Condition [13]. The kernel function can also be considered as 
a measure of similarity between two input data points. When 
using a kernel function, the output of the function between 
two instances (i.e., K(x^,x^)) replaces their dot product in 
the dual quadratic optimization framework (see [13, 14] for 
details). Hence, the actual higher dimensional representation 
^(x^) need not be computed explicitly. 

In the experiments conducted, the polynomial kernel func- 
tion with exponent 2.0 was used [14]. The pairwise-coupling 
method of Hastie et al. [15] was applied to generalize the 
original binary classification SVM algorithm to the multi-class 
problem of object recognition. The SVM implementation in 
the WEKA machine learning library [12] was used, which 
implements the sequential minimal optimization algorithm for 
training the model [16]. To obtain a probabilistic estimate for 



the class label of a test data point, logistic regression models 
are fit to the outputs of the SVM, as described in [12]. 

3) Bayesian Networks: Bayesian network is a generative 
probabilistic graphical model that represents a set of vari- 
ables and their probabilistic independencies [17]. Formally, a 
Bayesian network < G, 6 > is defined over a set of variables 
X = xi, . . . , x^, such that G is a directed acyclic graph whose 
nodes represent the variables x, and 6 = {Oi} represent the 
set of parameters defining the conditional probability of each 
node in the graph given its parents, i.e., Pr{xi\xpar{xi)-,^i) 
where Xpar(xi) is the set of parent nodes of Xi in G. Due 
to space constraints, the reader is referred to [17] for details 
regarding the Bayesian network model. 

A Bayesian network can be learned from a set of data by 
inducing the network structure G and estimating the parame- 
ters 6 that maximize some particular objective function (e.g., 
log-likelihood). In this study, the WEKA [12] implementation 
for a Bayesian network was used, which learns the network 
using the hill climbing algorithm proposed by Cooper et 
al. [18]. All training and inference parameters available in 
the WEKA implementation were set to their default values. 
Because Bayesian networks are designed to work on discrete 
data, the numeric features of each sound are discretized using 
the discretization filter in the WEKA library [12]. 

C. Performance Evaluation 

The generalization performance of the models M grasps 
Mpush^ and Mdrop is estimated using leave one out cross- 
validation. Let {Si^Oi}, where z = 1, . . . , A^, be a set of data 
for some given behavior B e B. During each iteration of the 
cross-validation procedure, one data point from the set is used 
for testing and the rest A^ — 1 data points are used for training 
the model Mb . For each behavior, there are six data points for 
each of the eighteen objects, resulting in N = 6 x 18 = 108. 

The performance of the models is reported in terms of the 
percentage of correct predictions, i.e., accuracy, where: 

^ , # correct predictions 

% Accuracy = ^^- —^ x 100 

# total predictions 

To evaluate whether multiple different interactions with 
the same object improve prediction, the predictions of the 
models Mgrasp, ^push^ and Mdrop are aggregated using equal 
weight as follows. Let 5f""'^, S^"^ , and 5f "^ be the sounds 
generated when applying the three behaviors to the same 
object Oi during the i^^ trial such that neither of the three 
sounds appears in the training sets for the three different 
models M grasps ^push^ and Mdrop- Once the models for each 
behavior B e B sltq trained, the combined prediction is then 
assigned to the object class o e O that maximizes: 

PrB{0^ = o\S^) 
BeB 
The goal of this procedure is to determine whether applying 
multiple distinct behaviors to the same object will result in 
better prediction performance. The next section summarizes 
the results and compares the performance of the three different 
learning algorithms used in the experiments. 
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V. Results 

Table I shows the performance rates of the models M grasps 
Mpush^ Mdrop^ along with the combined model when using 
the three different learning algorithms: k-NN, Support Vector 
Machine (SVM), and Bayesian Networks. As a point of 
reference, a random predictor would achieve about (1/18) x 
100% = 5.6% accuracy, given that |C>| = 18. 

A. Comparison of Learning Algorithms 

The first observation is that all three learning algorithms 
can recognize the objects in the test set significantly better 
than chance. Overall, the Bayesian Network learning model 
significantly outperforms k-NN and SVM, resulting in high 
accuracy rates, with the exception of the Mdrop model whose 
performance is similar with both the SVM and Bayesian 
Network learning models. 

A possible reason why the memory-based k-NN model does 
not perform well is that there are irrelevant features in the 
sound feature vector. This is likely due to the background 
noise produced by the ventilation and air conditioning systems 
in the lab. The discriminative SVM model also suffers from 
this drawback when using standard kernel functions (such as 
the polynomial or Radial Basis Function kernels). In addition, 
empirical results suggest that generative models (such as 
Bayesian Networks) achieve their asymptotic error rates with 
less data then their counterpart discriminative model, thus 
making them preferable when training data points are scarce 
[19]. Given that for each behavior there are only 6 data points 
per object class, the amount of data recorded is relatively 
small which may be another reason why SVM was not able 
to outperform the Bayesian Network model. 

B. Single vs. Multiple Behaviors 

The results in Table I show that object recognition based 
on auditory information is most difficult when using the 
dropping behavior, i.e., many objects sound very similar (given 
the feature representation used) when dropped. The grasping 
behavior, on the other hand, produces sound feature vectors 
that are most informative of the object being grasped (when 
using SVM and Bayesian Network). 

In the case of the push behavior, the Bayesian Network 
model makes several prediction mistakes. For example, the 
rubber ball gets mis-classified as a tennis ball in three out of 
six trials, while the tennis ball gets mis-classified as being a 
plastic ball in all 6 trials. The soft plastic cup is also mis- 
classified as a hard plastic cup in two out of four trials. 



while the pop can gets classified as a plastic box once. These 
mistakes show that given the feature representation, there are 
pairs of objects that sound very similar when a given behavior 
is applied to them. This was true for all three behaviors. 

The Bayesian Network combined model, M combined^ makes 
only three prediction errors: the whiteboard eraser is mis- 
classified as a hockey puck once, the tennis ball is mis- 
classified as a rubber ball once; and the rubber ball is 
predicted as being the tennis ball in one out of the six 
trials. The low error rate shows that using multiple different 
interactions (i.e., different exploratory behaviors) with the 
object significantly boosts prediction performance for all three 
learning algorithms. This indicates that the errors that the three 
models, M grasps Mpush^ and Mdrop make are uncorrelated. 
As noted earlier, it is very difficult to discriminate between 
the tennis ball and the rubber ball based on the sounds 
they make when applying the push behavior. However, the 
Mgrasp model achieves perfect classification for these two 
objects, thus helping resolve the ambiguity between them. 
Similarly, the Mdrop model cannot distinguish well between 
the metal plate and metal flange objects, while the M^ush 
model achieves perfect classification for these two objects. 

C. Performance vs. Amount of Training Data 

The Bayesian Network models (i.e., the models that achieve 
the best performance for this task) were also evaluated by 
varying the amount of data available in the training set. 
Figure 5 shows the accuracy rates as the number of trials with 
each object is varied from two to six. Even with just two trials 
per object (i.e., one in the training set and one in the testing 
set), the models M grasps and Mpush can predict the object 
class significantly better than chance. As expected, access to 
more training data leads to improved accuracy for all models. 

Finally, the performance of the three different learning 
algorithms was also evaluated as a function of that amount of 
training data that was used. Figure 6 shows the performance 
of the k-NN, SVM, and Bayesian Network algorithms for the 
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Fig. 6. Performance of the models M grasp using the three different learning 
algorithms as the number of trials with each object is varied from 2 to 12. 



model M grasp as the number of trials per object is varied 
from 2 to 12 (six additional grasping trials were recorded for 
each object during the same recording session). The Bayesian 
Network model reaches its asymptotic error rate much quicker 
than the k-NN and SVM learning algorithms. 

VI. Conclusions AND Future Work 

The study presented in this paper investigated how a robot 
can use auditory information to recognize the object it is inter- 
acting with. The proposed framework uses machine learning 
methods and different exploratory behaviors which allows the 
robot to predict the object class (one of eighteen possible 
objects) given the detected sound and behavior. 

The framework presented here uses standard machine learn- 
ing algorithms to solve the task. Three such algorithms were 
evaluated: k-Nearest Neighbor, Support Vector Machine, and 
Bayesian Network. The algorithms represent three major fam- 
ilies of machine learning models: instance-based, discrimi- 
native, and probabilistic generative graphical models. While 
all models performed significantly better than chance, the 
Bayesian Network model was able to achieve the highest level 
of accuracy with the least amount of training data. Unlike the 
standard SVM and k-NN models, the Bayesian Network model 
is able to detect the irrelevant features in the sound feature 
vectors produced by background noise (e.g., due to the air 
conditioning system in the lab). 

The robot used three different exploratory behaviors to 
interact with the objects: grasping, pushing, and dropping. For 
each of the three behaviors, there were pairs of objects which 
sounded very similar (in terms of the feature representation 
that was used), e.g., the tennis ball and the rubber ball were 
almost indistinguishable when being pushed to the side. How- 
ever, by applying multiple behaviors to the same object, the 
predictions of the models for each behavior can be combined 
which resulted in higher accuracy rates than for any single 
behavior performed alone. These results were consistent for 
all three learning algorithms evaluated in this study. 



While the number of objects used in this study was relatively 
high (in comparison to other related work in the field, see 
Section II), it is still a challenging task to classify the hundreds 
of different objects found in human environments based solely 
on their acoustic properties. Different initial configurations 
between the object and the robot add additional difficulties 
to the problem. One possible approach to be investigated in 
future work is to use unsupervised methods (e.g., hierarchical 
clustering) in order to learn categories of object sounds. 
Such an approach would allow the robot to use standard 
machine learning methods for each object category (which 
would contain a smaller set of objects than the total set to 
which the robot is exposed) while at the same time learn how 
the categories relate to each other. Also, more powerful feature 
representations may be used in order to capture and model the 
temporal patterns in some objects interactions (e.g., periodicity 
of bouncing). 
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